1 Introduction

With platforms offering cheap flights and accommodation, travelling around in Europe is common among students nowadays. Sharing economy services such as Airbnb have facilitated the search for a spare room rented out by a private agent. Profits of the platform have skyrocketed in the past years and a typical UK host now earns around £3,000 a year (Cox, 2017a). As those profits are paid by the user, understanding the pricing of Airbnb offers becomes crucial.

If you are on a tight budget you want to get the best value for your money. At the same time, you have certain expectations about the type of flat, its location and the offered amenities. This paper’s aim is to explore the various factors influencing the price and to set up a regression model which explains the pricing on Airbnb, so you can find the right price for the right flat.

2 Description of the dataset

The dataset of this paper covers all Airbnb offerings in London as per the 4th and 5th of March 2017. It contains 53,904 observations for 95 different variables. Its source is the website “Inside Airbnb - Adding data to the debate” (Cox, 2017b). This is an independent and non-commercial project aiming to examine the effect of Airbnb activities on urban development.

To allow this investigation to be more focused, the dataset was narrowed down. Only private rooms with at least three valid ratings were included. The resulting dataset has 7,020 observations for 78 variables. All data, modification code and documents can be accessed via the Groups GitHub Repository.

2.1 Price

Table 1: Descriptives of the Price
Min Q1 Median Mean Q3 Max
8 35 45 50 59 590

A room in London costs on average £50 per night. The summary statistics show that 75 percent of all Airbnbs are priced at £59 per night or less. However, there are some severe outliers that range up to a maximum of £590.

This raises concerns about the normality of its distribution. In fact, the plot to the left shows the distribution is not normal. In order to normalize the presented data set, the price is converted with a natural logarithm.

Figure 1: Density of Price and ln(Price)

Figure 1: Density of Price and ln(Price)

2.2 Rent

With London being one of the most expensive cities to live, rent prices are a major cost of being a host on Airbnb. Rents are also an interesting indicator of the attractiveness of the neighbourhood. Therefore, the impact of the underlying rent on the Airbnb price has to be accounted for. The initial dataset holds no information on the regular rent price at the location of an Airbnb. Fortunately, a website called “Find Properly” (Lokku Ltd., 2017) utilizes the data from Zoopla and provides the rent and selling price for each London region, divided per post code. Using the post code, the average weekly rent for properties with one bedroom is merged with the Airbnb data set. The matching was done based on the Outward code.

Geographically mapping the mean rent and the logarithmically transformed Airbnb price reveals the positive correlation (+ 0.48) between the variables. Nevertheless, it also becomes clear that there is more to an Airbnb price than just the average rent in the particular neighbourhood.

Figure 2: Mapping Rent Prices vs. Airbnb Prices

Figure 2: Mapping Rent Prices vs. Airbnb Prices

2.3 Location

When choosing an Airbnb in London, staying close to the city centre is preferred by many. Distance is defined as the distance to the touristic city centre - Picadilly Circus. It was calculated by using the Haversine formula (Reid, 2011) and the geographic coordinates of Picadilly Circus (Longitude: -0.133869, Latitude: 51.510067) (Latlong.net, 2017). The correlation between distance and the logarithmic Airbnb price is negative and weak (-0.39). The closer the property is to the city center, the higher the price is. Upon analyzing the different bins, the most positive outliers are located in the first three bins, i.e. high-end rooms are situated close to Picadilly Circus. Price range also shrinks the further the flat is from the city center.

Figure 3: Rent Prices vs. Airbnb Prices in London

Figure 3: Rent Prices vs. Airbnb Prices in London

2.4 Reviews

Reviews could be a useful indicator of various characteristics of the room advertised. In addition to the written reviews, guests can give their hosts star-ratings on the following parameters (Airbnb Inc., 2017b): Overall experience, accuracy, cleanliness, communication, check in, location and value. Most of those are self-explanatory; accuracy represents the extent to which the online listing represtents the reality, and value is a subjective measure of whether the room was worth the price paid.

The guest ratings are translated into a score out of 10 for the individual categories, and a score out of 100 for the overall score. The mean value for many categories is 9 or 10. Such high scores are frequently seen when feedback from users is collected. For example, Uber considers removing drivers rated on average less than 4.6 stars out of 5 (Insider, 2015).

Since the overall score is submitted independently, rather than calculated from the category scores, it is interesting to see which categories affect the user’s overall rating the most. All the subcategory rating have at least a moderate, positive relation to the overall score. The correlations between the overall score and value, check in and accuracy are the strongest, suggesting that those categories matter most for the guest’s overall satisfaction. In general, there is no significant relation between the different rating scores and price, suggesting the use of these indicators will have little effect on the goodness of fit. Score of location, however, has a weak positive correlation to the logarithmic price, making it an interesting indicator for the security, comfort and attractiveness of the neighbourhood.

Table 3: Correlation Analyses on Review Scores
Name Minimum Maximum Mean Correlation to Overall Score Correlation to Price
Accuracy 2 10 9 0.77 0.09
Check In 2 10 10 0.78 0.14
Cleanliness 2 10 9 0.67 0.10
Communication 4 10 10 0.68 0.10
Location 3 10 9 0.54 0.30
Value 2 10 9 0.79 0.04
Overall 20 100 92 1.00 0.13

2.5 Property Characteristics

2.5.1 Accommodates and Beds

Table 4: Results cor.Test on Capacity Indicators
Variable P-Value Conf-Int. Low Estimate Conf-Int. High
Accommodates < 0.01 0.32 0.34 0.36
Beds < 0.01 0.19 0.21 0.23

The variables accommodates (how many people can stay in the property) and beds (the number of beds in the property) give an indication of the overall capacity of the Airbnb. Both variables have a relation to the room price that is significantly different to zero. However, both correlations are weak, suggesting that even though price rises with capacity, it rises slowly.

2.5.2 Amenities

Airbnb includes some general information on the property such as the room type, the number of people that can be accommodated or the number of bathrooms. On top of these characteristics, Airbnb contains information on a wide range of amenities for every flat. These range from the availability of Internet and a TV to a personal doorman or a pool. In order to analyse these, dummy variables for 53 different amenities, with 46 resulting in usable data, as well as a variable counting the total number of amenities were introduced.

To reduce the count, seven amenities that lead to a significant difference in price were chosen. Prioritized were some home essentials such as access to a kitchen, a lock to secure the personal space, a TV, a dryer and a washer, facilities like an elevator and the family-friendliness of the room. The presence of a TV, an elevator, a dryer or a washer as well as the family friendliness of a room tends to have a positive impact on the flat’s price, this is especially true for the TV. Interestingly, it seems that Airbnbs that have access to a kitchen and a lock on the bedroom door seem to be slightly less valued. Perhaps a lock on the bedroom door is more commonly in place in less safe locations.

Table 5: Results t.Tests on selected Amenities
Amenities P-Value ln(Price) Incl. ln(Price) Excl. Difference ln(Price)
Washer 0.03 3.83 3.80 0.03
TV < 0.01 3.89 3.74 0.15
Familiy / Kid-Friendly < 0.01 3.87 3.79 0.08
Dryer < 0.01 3.92 3.77 0.15
Kitchen 0.52 3.82 3.83 -0.01
Elevator in Building < 0.01 3.89 3.79 0.10
Lock on Bedroom Door < 0.01 3.79 3.83 -0.04

2.6 Attributes of the Ad

Table 6: Test Results on Ad Attributes
Attributes P-Value
Instant Bookable 0.84
Cancellation Policy < 0.01

Usually, a guest needs to submit a booking request and gets to stay in the property only if the host approves that request. To attract more customers, some hosts allow instant booking of their properties, which is similar to booking a hotel - the user just books the property straight away. In the dataset, TRUE means guests can book the desired property instantly, while FALSE means they have to get approval from the host first.

In addition to instant book, hosts also have the right to choose their own cancellation policy. Cancellation policy determines whether or not guests can get a refund and how they can be refunded. There are several cancellation policies form which hosts can choose, including flexible, moderate, strict and super strict. If flexible, guests may get a full refund if the reservation is cancelled within a limited period, typically 24 hours prior to the check in. If moderate, fees are fully refundable but only if cancelled a longer time in advance. Under the strict policy, only 50% of fees may be refunded if the booking is cancelled more than 1 week before check in. (Airbnb Inc., 2017a) While the difference in mean of rooms with instant bookings is insignificant, the correlation between the scale version of cancellation policy is significantly but weakly correlated to the price of the room.

3 Regression model

Table 7: Regression Results
Dependent variable:
ln(Price)
(1) (2)
Mean Rent 0.001*** (0.0001) 0.001*** (0.0001)
Distance -0.013*** (0.001) -0.013*** (0.001)
Review Score - Rating 0.007*** (0.001)
Review Score - Accuracy -0.016** (0.008)
Review Score - Check-In 0.014* (0.008)
Review Score - Cleanliness 0.040*** (0.006) 0.037*** (0.004)
Review Score - Communication 0.001 (0.009)
Review Score - Location 0.080*** (0.006) 0.072*** (0.006)
Review Score - Value -0.083*** (0.008)
Accommodates 0.160*** (0.006) 0.151*** (0.005)
Number of Beds -0.023** (0.010)
Amenity - Dryer 0.072*** (0.008) 0.062*** (0.008)
Amenity - Elevator 0.045*** (0.008) 0.043*** (0.008)
Amenity - Family friendly 0.007 (0.008)
Amenity - Lock on Bedroom Door -0.045*** (0.010) -0.049*** (0.010)
Amenity - TV 0.116*** (0.008) 0.116*** (0.008)
Amenity - Washer -0.044*** (0.010)
Instant bookable - FALSE 0.004 (0.010)
Cancellation Policy - Moderate -0.008 (0.009)
Cancellation Policy - Strict 2.111*** (0.070) 2.003*** (0.054)
Observations 7,020 7,020
R2 0.433 0.420
Adjusted R2 0.432 0.420
Residual Std. Error 0.304 (df = 7000) 0.307 (df = 7010)
F Statistic 281.780*** (df = 19; 7000) 565.119*** (df = 9; 7010)
Note: p<0.1; p<0.05; p<0.01

3.1 Interpretation

As our dependent variable was transformed to its logarithmic version, a log-linear regression model is used to explain the effect of the independent variables on the dependent variable. Comparing the two versions of the model, it becomes clear that some variables are insignificant, some have a multicollinearity problem and the review score for value is likely to have an endogeneity problem, as the rating of the user is dependent on the price of the ad. Additionally, some of the amenities had a negative impact on the price, which is counterintuitive and contradicts the results of the t-test. As the effects are small and likely to be caused by random noise, such variables are excluded:

\[\begin{gather*} ln(price) = \beta_0 + \beta_1(mean\_rent) + \beta_2(distance) + \beta_3(review\_scores\_cleanliness) + \\ \beta_4(review\_scores\_location) + \beta_5(accommodates) + \beta_6(dryer) + \\ \beta_7(elevator) + \beta_8(lock) + \beta_9(TV) + u \end{gather*}\]

42 percent of the variance of the dependent variable can be explained by the variance in the independent variables. The standard error of the model is 0.307 off from the real value and the F-statistic is highly significant. Thus, the model provides a far better explanation than just the fit intercept model. The y-intercept is located at £7.41. In regard to the variables in the model, the intercept is not meaningful because there will be no room advertised with a capacity of zero people. The other coefficients show by how many percentage points the price changes if the explanatory variable changes by one unit holding all other independent variables constant. For example, for every additional person a room can accommodate, the price rises by 15 percent. The former, the review scores for location as an indicator for attractiveness of the neighbourhood and the presence of a TV have the largest positive effects on the price of a room. If there is a lock on the bedroom door the effect on the price is negative. The effect of distance on room price appears to be small but putting it into context reveals that for every kilometer further from Piccadilly Circus the price shrinks by 1.3 percent.

3.2 Fitting the model

3.2.1 Residuals

Table 8: Residuals
Group Count ln(Price) Price
Residuals >0.3 1018 4.37 85.05
Residuals <=0.3 6002 3.73 44.14

A short exploration of the residuals shows that the mean of observations with larger residuals was higher compared to the other observations. As the presented scatter plot illustrates, the model is less accurate on explaining more expensive rooms. The factors chosen do not fully explain the difference in price. A Durbin-Watson test on the regression model shows that the error values are uncorrelated.

Figure 5: Residuals and ln(price)

Figure 5: Residuals and ln(price)

3.2.2 Sensitivity to outliers

As in any regression model based on ordinary least squares, the coefficients in our model are affected by outliers. Some of the properties in our data set cost more than £400 per night, while most of them cost below £100. The outliers may have disproportionately affect our coefficients, making them less accurate for the remaining variables. However, as it would be bad practice to exclude certain observations of the regression, none of them are treated.

3.2.3 Multicollinearity

Table 9: Results Multicollinearity Test
VIF
Mean Rent 1.76
Distance 1.68
Review Score - Cleanliness 1.28
Review Score - Location 1.37
Accommodates 1.03
Amenity - Dryer 1.05
Amenity - Elevator 1.02
Amenity - Lock on Bedroom 1.03
Amenity - TV 1.07

Most correlated variables were already excluded from the regression, as those relations might increase the error terms of the model. A VIF of four implies that the variance of the estimators in the model are four times higher than if the independent variables were uncorrelated. Usually, a VIF greater than 3 is considered critical to the model results. None of the used variables reaches that border value.

3.2.4 Omitted variable bias

The price of an Airbnb is affected by a large number of factors. The presented model includes some of them, but it was not feasible or possible to include data concerning every single possible determinant. As a result, the model likely suffers from omitted variable bias. It under- or overestimates the effect of some of the existing factors to compensate for the missing information, making the model less reliable.

3.2.5 Lack of clustering

By putting all properties into one model, we ignore the fact that there might be different profiles of properties and for each profile, different characteristics might be relatively more important. Perhaps there is a set of properties that are popular with students coming to London for graduate job interviews, who would see location close to the financial centers and low price as important factors. And, perhaps, different types of properties are popular with middle-aged tourists - then the proximity to the popular sights and the level of comfort provided might matter more. If we divided our properties into clusters which share similar characteristics, and then ran a regression analysis for each cluster, we might get a more accurate model for each cluster.

3.2.6 Limitations

The dataset does not contain several important variables, such as the size of the room, the proximity of the flat to a tube station, the age of the flat, the quality of the equipment and furniture in the flat or the attractiveness of the apartment and the building. Additionally, the attractiveness of the room and the house it is in were not quantifiable. As this attractiveness differs across buildings and sometimes even within a building, it is impossible to predict the price of an apartment that exceeds expectations set by the base explanatory variables used in the regression.

4 Conclusion

The results of this paper have direct implications on understanding the pricing of Airbnbs in London. Most apartments will have a price below £100, with some highly expensive exceptions. Secondly, location is the predominant factor on the price. Both mean rent, which accounts for the attractiveness of the neighbourhood, and the distance to city centre have a strong impact on the regression model. So, when on a short budget, deductions on the location can be recommended.
On the other hand, only some of the farther-reaching factors showed a significant impact. Reviews tend to be very high and intercorrelated for all apartments. Therefore, only cleanliness and location reviews made it into the final model. Many amenities important for someone searching for a room, like WiFi and the access to a properly equipped kitchen, have small effects on the room price, as they can be found in most London apartments. In contrast, Luxury amenities like the presence of an elevator, a TV and a dryer create costs for the host and thus increase the price.
Looking at these different attributes of an Airbnb ad, the user is able to determine whether the price of the apartment is actually fair, which was the aim of this report.

This paper could be improved through several extensions. Machine learning methods applied on the photos of the room could provide a more objective measure of its attractiveness. Moreover, many other types of data could be collected. For example, the age of the flat or the crime rate in the neighbourhood could provide more information not currently present in the dataset. It could also be explored how Airbnb prices vary in time – whether rooms booked last-minute tend to be more expensive, and whether the prices are affected by seasonality.

Bibliography

Airbnb Inc. (2017a) Cancellation policies. [Online]. Available from: hhttps://www.airbnb.co.uk/home/cancellation_policies#strict.

Airbnb Inc. (2017b) How do star ratings work. [Online]. Available from: https://de.airbnb.com/help/article/1257/how-do-star-ratings-work.

Cox, J. (2017a) Airbnb: Surge in uk hosts over past year boosts local economies. The Independent. [Online] Available from: http://www.independent.co.uk/news/business/news/airbnb-hosts-uk-surge-boost-local-economies-online-holiday-rental-london-southwest-northern-ireland-a7940451.html.

Cox, M. (2017b) Inside airbnb - adding data to the debate. [Online]. Available from: http://data.insideairbnb.com/united-kingdom/england/london/2017-03-04/data/listings.csv.gz.

Latlong.net (2017) Get latitude and longitude. [Online]. Available from: https://www.latlong.net.

Lokku Ltd. (2017) London house prices by postcode. [Online]. Available from: https://www.findproperly.co.uk/london/postcode/#.WdvonHeZNn4.

Reid, M. (2011) Haversine formula. [Online]. Available from: http://wordpress.mrreid.org/2011/12/20/haversine-formula/.

Furthermore, for plotting our observations on a ggmap, we consulted the following sources:

Irawan, D.E. (2014) How to convert lat-long coordinates to utm. [Online]. Available from: https://rpubs.com/dasaptaerwin/19879.

Lovelace, R. & Cheshire, J. (2014) Introduction to visualising spatial data in R. National Centre for Research Methods Working Papers. [Online] 14 (03). Available from: https://github.com/Robinlovelace/Creating-maps-in-R.

The header photo was downloaded from Pexels and is licence free. Available from: https://www.pexels.com/photo/architecture-buildings-business-capital-417382/

Imperial College Business School